Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you |
2 similar comments
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you |
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you |
| pip install -q datasets pandas | ||
|
|
||
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend vllm \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts "$((CONC * 10))" \ | ||
| --max-concurrency "$CONC" \ |
There was a problem hiding this comment.
🔴 The run_benchmark_serving call is missing --use-chat-template, which every other MTP benchmark script in the repo (6 out of 6) includes. Without this flag, MTP acceptance rates are artificially high because raw text without chat formatting special tokens is easier for the draft model to predict, producing misleading benchmark results. Add --use-chat-template after the --result-dir line to match the established pattern.
Extended reasoning...
What the bug is
The new qwen3.5_fp8_h200_mtp.sh benchmark script omits --use-chat-template from its run_benchmark_serving call (lines 69-79). This flag is present in every other MTP benchmark script in the repository.
Evidence of the pattern
All 6 existing single-node MTP benchmark scripts include --use-chat-template:
dsr1_fp8_b200_mtp.sh(line 108)dsr1_fp4_b200_trt_mtp.sh(line 133)dsr1_fp8_b200_trt_mtp.sh(line 143)dsr1_fp8_h200_trt_mtp.sh(line 115)dsr1_fp4_mi355x_atom_mtp.sh(line 71)dsr1_fp8_mi355x_atom_mtp.sh(line 70)
Additionally, the multi-node AMD utility (bench.sh:60) adds this flag generically for ALL MTP benchmarks via [ "$IS_MTP" = "true" ] && echo "--use-chat-template", confirming this is a model-agnostic requirement, not DeepSeek-specific.
Root cause
The script was likely copied from the non-MTP qwen3.5_fp8_h200.sh (which correctly omits the flag since MTP acceptance rates are irrelevant without speculative decoding) but failed to add --use-chat-template as all other MTP scripts do.
Step-by-step proof of impact
- The benchmark runs with EAGLE speculative decoding enabled (
--speculative-algorithm EAGLE, lines 55-58). run_benchmark_servingsends prompts to the server. Without--use-chat-template, raw text is sent without chat formatting special tokens.- The draft model finds raw text easier to predict than properly formatted chat messages (which contain special tokens like
<|im_start|>,<|im_end|>, etc.). - This results in artificially higher MTP acceptance rates.
- The benchmark reports misleadingly optimistic throughput numbers that won't reflect real-world chat serving performance.
How to fix
Add --use-chat-template to the run_benchmark_serving call, e.g. after --result-dir /workspace/. This is a one-line addition that aligns the script with every other MTP benchmark in the repository.
| - qwen3.5-fp8-h200-sglang-mtp | ||
| description: | ||
| - "Add Qwen3.5-397B-A17B-FP8 H200 SGLang MTP (EAGLE speculative decoding)" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/921 |
There was a problem hiding this comment.
🟡 Nit: The pr-link for the new qwen3.5-fp8-h200-sglang-mtp entry uses a placeholder /pull/XXX instead of /pull/921. Please update before merging.
Extended reasoning...
Bug Description
The new perf-changelog entry added at line 987 for qwen3.5-fp8-h200-sglang-mtp uses a placeholder PR link:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXinstead of the actual PR number:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/921Code Path
The diff adds a new changelog block at the end of perf-changelog.yaml (lines 982-987). Every other entry in the file that was finalized has a concrete PR number in its pr-link field, making this an outlier that needs updating.
Pre-existing Context
There are several other pre-existing XXX placeholders in the file (e.g., for glm5-fp8-mi355x-sglang, dsr1-fp8-h200-sglang, minimaxm2.5-fp8-h200-vllm, qwen3.5-bf16-mi325x-sglang, qwen3.5-fp8-mi325x-sglang). However, those are from other PRs and outside the scope of this change. This PR should fix its own entry.
Impact
The impact is low — this is a metadata/documentation field, not functional code. The placeholder link would point to a nonexistent or incorrect pull request page, making it harder for someone reviewing the changelog to trace the entry back to its source PR.
Suggested Fix
Replace XXX with 921 on line 987:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/921Given that the PR title is [WIP], this is likely a known TODO that the author plans to fix before final merge. Flagging it here as a reminder.
No description provided.